Basic dataviz with R and ggplot2



This practical will lead you through the basics of ggplot2. It is split in 5 parts, each dedicated to a big chart family: correlation, distribution, ranking, part of a whole and evolution.

Get ready


This practical will require several R packages. Here is how to install the ggplot2 package:

# Install the package if needed
#install.packages("ggplot2")

# Load it
library(ggplot2)


Q0.1 - Install and load the following libraries that will be useful all along the practical: dplyr.

# Install the package if needed
#install.packages("dplyr")

# Load it
library(dplyr)

1- Correlation


The first part of this practical will guide you through common practice for the visualization of correlation. It will cover chart types like scatterplots, bubble plots, 2d density charts and others.


Q1.1 - Load the gapminder dataset stored in the gapminder package. Have a look to the 6 first rows using the head() function. Describe briefly what you see as comments in your script. Check how many rows are available with nrow()

# Install the package if needed
#install.packages("gapminder")

# Load it
library(gapminder)

# Have a look to the first rows
head(gapminder)
# How many rows?
nrow(gapminder)
## [1] 1704


Q1.2 - How many years are available in this dataset? How many data-points for each year? Same question for country and continent. Use the nlevels() function to know the number of levels of a factor. Use the table() function to see the occurence of each level.

# Number of different year?
gapminder %>%
  select(year) %>%
  unique() %>%
  nrow()
## [1] 12
# or
length(unique(gapminder$year))
## [1] 12
# Number of country available per year?
gapminder %>%
  group_by(year) %>%
  summarize( n=n() )


Q1.3 Build a scatterplot showing the relationship between gdpPercap and lifeExp in 1952. What do you observe?

# basic scatterplot
gapminder %>%
  filter(year=="1952") %>%
  ggplot( aes(x=gdpPercap, y=lifeExp)) +
    geom_point()


Q1.4 On the previous chart, one country is very different. Which one is it?

# Number of different year?
gapminder %>%
  filter(year=="1952" & gdpPercap>90000) 


Q1.5 Build the same chart, but get rid of this country. What trend do you observe? Does it make sense? What’s missing? What could be better?

# basic scatterplot
gapminder %>%
  filter(year=="1952" & country!="Kuwait") %>%
  ggplot( aes(x=gdpPercap, y=lifeExp)) +
    geom_point()


Q1.6 - Color dots according to their continent. In the aes() part of the code, use the color argument.

gapminder %>%
  filter(year=="1952" & country!="Kuwait") %>%
  ggplot( aes(x=gdpPercap, y=lifeExp, color=continent)) +
    geom_point()


Q1.7 - Let’s observe an additional variable: make the circle size proportionnal to the population (pop). This is done with the size argument of aes(). How do you call this kind of chart?

gapminder %>%
  filter(year=="1952" & country!="Kuwait") %>%
  ggplot( aes(x=gdpPercap, y=lifeExp, color=continent, size=pop)) +
    geom_point()


Bonus You’re in advance? Try to do the following:

  • custom the general appearance using the theme_ipsum of the hrbrthemes library.
  • add transparency to circles to limit overlapping impact, with the alpha argument of aes
  • sort your data by pop size to put the small circle on top of the chart, not hidden by big bubbles
  • use the ggplotly() function of the plotly package to make this chart interactive
# Additionnal packages:
library(hrbrthemes) # for general style
library(plotly)     # to make the chart interactive

# Chart
p <- gapminder %>%
  filter(year=="1952" & country!="Kuwait") %>%
  arrange(desc(pop)) %>%
  ggplot( aes(x=gdpPercap, y=lifeExp, fill=continent, size=pop)) +
    geom_point(alpha=0.7, stroke="white", shape=21) +
    theme_ipsum()

# Interactive more
ggplotly(p)

2- Distribution


This second part is dedicated to the visualization of distribution. It is split in 2 parts:

  • Visualizing one distribution
  • Comparing distribution for several groups or variables



2.1 - One distribution

Example dataset provides the AirBnb night prices of ~1000 appartments on the French Riviera. Data is stored on Github and can be loaded in R as follow:

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)


Q2.1.1 - How many rows in the dataset? (use nrow()) What is the min? The max? (use summary()). Do you see anything strange? What kind of chart would you build to visualize this kind of data?

#nrow(data)
#summary(data)


Q2.1.2 - Build a histogram of the data with geom_histogram(). Are you happy with the output? How can we improve it?

data %>%
  ggplot( aes(x=price)) +
    geom_histogram()


Q2.1.3 - Build a histogram without price over 1500 euros. ggplot2 displays a warning message, why? What does it mean? What’s the main caveat of histograms?

data %>%
  filter(price<1500) %>%
  ggplot( aes(x=price)) +
    geom_histogram()


Q2.1.3 - Build the histogram with different values of binwidth, for prices <400. What do you observe? Is it important to play with this parameter?

data %>%
  filter(price<400) %>%
  ggplot( aes(x=price)) +
    geom_histogram(binwidth = 2)


Q2.1.4 - Use geom_density() to build a density chart. Use the fill argument to set the color. Use the help() function to find out what is the equivalent of bin_width for density chart? Check its effect using different values.

data %>%
  filter(price<1000) %>%
  ggplot( aes(x=price)) +
    geom_density(color="transparent", fill="#69b3a2", bw=5)



2.2 - Several distributions

Dataset: questions like What probability would you assign to the phrase ???Highly likely??? were asked. Answers were given in the range 0-100. It allow to understand how people perceive probability vocabulary. Data is stored on Github and can be loaded in R as follow:

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/Teaching/master/DATA/probability.csv", header=TRUE, sep=",")


Q2.2.1 - As usual, check data main features with nrow(), summary() or any other function you think is useful.

# Data size?
nrow(data)

# occurence of each word:
table(data$text)


Q2.2.2 - What kind of chart would allow to compare the 8 categories?


Q2.2.3 - Build a basic boxplot using the default options of geom_boxplot()

ggplot(data, aes(x=text, y=value, fill=text)) +
    geom_boxplot() 


Q2.2.4 - What do you observe? Can you improve this chart? What would you change? Do you remind what the different parts of the box mean?


Q2.2.5 - Apply the following modifications to the previous boxplot:

  • order groups (increasing order).
  • flip X and Y axis (coord_flip())
  • remove legend (theme)
# Library forcats to reorder data
library(forcats)

# Reorder data
data %>%
  mutate(text = fct_reorder(text, value, .fun = median)) %>%
  ggplot(aes(x=text, y=value, fill=text)) +
    geom_boxplot() +
    theme(
      legend.position = "none"
    ) +
    coord_flip()


Q2.2.6 - What is the main caveat with boxplot? How can we do better?

# Library forcats to reorder data
library(forcats)

# Reorder data
data %>%
  mutate(text = fct_reorder(text, value, .fun = median)) %>%
  ggplot(aes(x=text, y=value, fill=text)) +
    geom_boxplot() +
    geom_jitter(color="grey") +
    theme(
      legend.position = "none"
    ) +
    coord_flip()


Bonus You’re in advance? Try to do the following:

  • build a violin plot with geom_violin()
  • search the internet to build a ridgeline chart.
  • find out how to add a red circle to represent the mean of each group

3- Ranking


Let’s talk about the quantity of weapons exported by the top 50 largest exporters in 2017 (source). The dataset is available here. Load it in R:

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/7_OneCatOneNum.csv", header=TRUE, sep=",")


Q3.1 - What kind of chart can you build with this dataset? Which one would be the best in your opinion?


Q3.2 - Start with a basic barplot using geom_bar().

data %>%
  ggplot( aes(x=Country, y=Value) ) +
    geom_bar(stat="identity", fill="#69b3a2")


Q3.3 - Previous barplot is a bit disappointing isn’t it? What can you improve? Do it!

data %>%
  filter(!is.na(Value)) %>%
  arrange(Value) %>%
  mutate(Country=factor(Country, Country)) %>%
  ggplot( aes(x=Country, y=Value) ) +
    geom_bar(stat="identity", fill="#69b3a2") +
    coord_flip() +
    theme(
      legend.position="none"
    ) +
    xlab("")

4- Evolution


let’s consider the evolution of the bitcoin price between April 2013 and April 2018.

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/3_TwoNumOrdered.csv", header=T)
data$date <- as.Date(data$date)
#Here is one change
data %>%
  ggplot( aes(x=date, y=value)) +
    geom_line(color="#69b3a2")

data %>%
  ggplot( aes(x=date, y=value)) +
    geom_area(color="#69b3a2", fill="#69b3a2")

data %>%
  tail(10) %>%
  ggplot( aes(x=date, y=value)) +
    geom_area(fill="#69b3a2", alpha=0.5) +
    geom_line(color="#69b3a2") +
    geom_point()

 




A work by a practical by Yan Holtz

yan.holtz.data@gmail.com